Skip to content

Improve research search with Tantivy-backed snippets#152

Draft
akseljoonas wants to merge 1 commit intomainfrom
codex/tantivy-research-search-20260427
Draft

Improve research search with Tantivy-backed snippets#152
akseljoonas wants to merge 1 commit intomainfrom
codex/tantivy-research-search-20260427

Conversation

@akseljoonas
Copy link
Copy Markdown
Collaborator

What this PR does

This replaces the old Whoosh-backed search inside ml-intern's research tools with a small Tantivy-based search layer. The goal is not to add RAG or embeddings; it is to make the existing research tools return more precise, source-addressable results so the agent spends fewer tokens finding the right docs or examples.

Whoosh is unmaintained and emits Python 3.12 warnings in local runs. More importantly, the old search ranked whole docs/pages and GitHub paths, so research calls often sent the model broad results instead of the exact useful passage.

User-visible behavior

  • explore_hf_docs now ranks markdown passages instead of whole pages. Results include the heading and line range for the matched section.
  • find_hf_api now uses the same Tantivy search layer for OpenAPI endpoint search.
  • github_find_examples still starts from example-like files, but now also indexes source snippets from public repo contents when a keyword is provided.
  • GitHub example results include exact github_read_file line ranges and focused excerpts around the query terms.
  • Public GitHub/HF docs search no longer hard-fails just because local auth is missing or a GitHub token is rejected. Auth is still used when it works.
  • Network-backed research data is cached on disk under .ml-intern-cache/search by default, or ML_INTERN_SEARCH_CACHE_DIR when set.

Implementation notes

  • Adds agent/search/ with:
    • TantivyTextIndex: small wrapper around tantivy for field-boosted BM25 search.
    • markdown/code chunking helpers with source line ranges.
    • JSON cache helpers for fetched docs, OpenAPI specs, repo trees, file contents, and prepared snippet docs.
  • Removes the Whoosh dependency and the Whoosh warning filter.
  • Skips raw .ipynb content indexing for now because notebook JSON produced noisy snippets and misleading line ranges; notebooks can still appear as path-level example results.

Validation

  • UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/unit/test_tantivy_search.py tests/unit/test_docs_tantivy_search.py tests/unit/test_github_find_examples_tantivy.py -q
    • 11 passed
  • UV_CACHE_DIR=/tmp/uv-cache uv run python -m compileall -q agent/search agent/tools/docs_tools.py agent/tools/github_find_examples.py
    • passed
  • Live tool checks:
    • explore_hf_docs on TRL query dataset_text_field SFTConfig packing returned SFT / Packing with source lines. Cached repeat was about 0.055s.
    • find_hf_api returned correct top endpoints for create repository, upload file, and space logs.
    • github_find_examples on huggingface/trl query grpo trainer returned focused source snippets and cached repeat was about 0.031s.
  • Real CLI check:
    • ml-intern --max-iterations 6 --no-stream "Research current TRL GRPOTrainer usage..." naturally called explore_hf_docs, github_find_examples, fetch_hf_docs, and github_read_file, then returned a researched GRPOTrainer answer.

Known unrelated issue

The full unit suite currently reports two existing tests/unit/test_doom_loop.py failures because tests still expect DOOM LOOP DETECTED while the runtime returns [SYSTEM: REPETITION GUARD]. This PR does not change that behavior.

Follow-up direction

This PR intentionally keeps scope to the search substrate. A natural next step is consolidating the research tools around a broader GitHub/HF interface, including model-accessible gh/hf CLI-style capabilities and more GitHub operations. The Tantivy layer here should give that future consolidation one shared precise search path instead of several independent ones.

Whoosh is unmaintained and emits Python 3.12 syntax warnings. More importantly, the existing research tools ranked whole pages/files and often forced the agent to spend tokens reading broad results before finding the useful passage.

This moves HF docs, HF OpenAPI, and GitHub example search onto a small Tantivy-backed search layer with passage/snippet chunking, source line ranges, and disk caches for network-backed research data. GitHub example lookup now searches file contents as well as paths, tolerates missing or rejected GitHub tokens for public repos, and returns focused snippets that the agent can follow up with github_read_file line ranges.

Constraint: Keep the PR scoped to search quality and do not introduce RAG or embedding infra.
Rejected: Keep Whoosh and suppress warnings | leaves the stale dependency and weaker result granularity in place.
Rejected: Index raw notebooks as snippets | raw ipynb JSON produced noisy excerpts and misleading line ranges.
Confidence: high
Scope-risk: moderate
Directive: Treat this as the search substrate for future research-tool consolidation; broader gh/hf CLI exposure should build on this rather than reintroducing independent search paths.
Tested: uv run pytest tests/unit/test_tantivy_search.py tests/unit/test_docs_tantivy_search.py tests/unit/test_github_find_examples_tantivy.py -q
Tested: uv run python -m compileall -q agent/search agent/tools/docs_tools.py agent/tools/github_find_examples.py
Tested: live explore_hf_docs, find_hf_api, github_find_examples calls with cached follow-up timings
Tested: real ml-intern CLI research prompt exercised explore_hf_docs, github_find_examples, fetch_hf_docs, and github_read_file
Not-tested: Full unit suite has two pre-existing doom-loop wording assertion failures unrelated to search.
@fglogan
Copy link
Copy Markdown

fglogan commented May 3, 2026

closed per maintainer request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants